When we load in the raw data, we found that a lot of categories keeps showing up within different seasons. Which makes us be curious about the frequency of how different categories show up, and what’s the top5 categories that show up in the most of times. And then there are some “biology” in our group, and we are curious about the mythology topic, we want to explore that category in depth, then we decided to search for some god names, that would be fun. While we try to work on the top 5 categories, we found an interesting phenomenon that before and after 1998, there is a huge drop on the question numbers, which let us assume if there is any limitation happened since 1998, and we have a figure to show that interesting phenomenon
Our research question aims to ask which categories appear most often in the 35 seasons that Jeopardy has run and if the top 5 categories have changed over time. Next we look at the most common categories that certain gods appear in, with some more in depth analysis of Zeus and Athena within these categories.
Question 1: What are the top 5 categories in all seasons, and how often do these occur per year?
Question 2: What are the top 5 categories that different kinds of gods (Greek, Norse, Hindu) are mentioned either in the question or answer? How often are the popular Greek gods, Zeus and Athena, mentioned over time and within each of these categories?
We read in the data and clean it by separating the air date into three columns, removing comments and notes, changing text to all lower case letters and removing data from 2019 as the season is only partially complete.
Methods: How did you investigate the data to try to answer your question? This should not include R code (save that for the tutorial part), but rather should use language like “To determine if … was associated with …, we measured the correlation …”. It’s fine for this project if the Methods are fairly simple (“We investigated the distribution of … using boxplots …”, “We took the mean and interquartile range of …”, “We mapped state-level averages of …”, etc.). Why do you choose to use the Methods you used? Why do you think they’re appropriate and useful for your project? -We cleaned the data by seperating the air date into three columns, and removed comments and notes. Then we checked the dataframe, and look at the occurancy of categories over all seasons by doing group_by and filtering. And then we visualized the occurance of the top5 categories by making a plot showing the counts of questions asked in the top five categories. - to explore further for the top5 , we made a plot to show the proportion of top 5 by year/season. In this plot we found that there is a huge drop for the total counts of questions been asked. - To explore our second question about the god names, we created dataframe filtered by different kinds of god (Greek, Norse, Hindu)), and then merge them into one dataframe.
Result to queation 01_a
| category | before_98 | after_98 |
|---|---|---|
| sports | 574 | 127 |
| science | 553 | 292 |
| history | 513 | 196 |
| literature | 503 | 301 |
| world geography | 484 | 145 |